Every time you use an app, visit a website, click on a link, fill out a survey or even just scroll on your device, your data is being:
Collected - What you click, search, watch, like or buy
Analyzed - Used to predict your behaviour, interests or identity
Shared or Sold - Passed to advertisers, data brokers or other companies
Why Does This Matter?
You may be targeted with ads, content and potentially misinformation
You could be judged or profiled based on your data (even if it’s not accurate)
You rarely know who has your data (or what they’re doing with it)
So what does this mean for us? Let’s explore how data can be used, what makes certain information sensitive and why it matters.
Personally Identifiable Information (PII)
PII refers to any data that can be used to identify a specific individual.
Direct identifiers: These clearly and uniquely point to a person.
Examples: name, social security number, patient ID
Indirect identifiers: These don’t identify someone on their own, but could when combined.
Examples: age, DOB, postal code, race, sex
Personal Data
Data can be identifiable when:
They contain directly identifying information.
It’s possible to single out an individual
It’s possible to infer information about an individual based on information in your dataset
It’s possible to link records relating to an individual.
De-identification is still reversible.
Scenario: Can This Data Identify You?
A fitness app shares anonymized data with researchers. The dataset includes:
Step count per day
General location (postal code)
Age
Time of day the user exercises
Health conditions
Separately, a publicly available dataset includes information from a local running club: names, age groups and 5K race times.
The Mosaic Effect
The “Mosaic Effect” can happens when separate pieces of data, which alone don’t identify anyone, are combined from different sources to reveal personal information or identify an individual.
In 2000, 87% of the United States population was found to be identifiable using a combination of their ZIP code, gender and date of birth.
Convert date of birth to age, or group into ranges
Replace address with town or region
Recategorize rare labels into “other” or “missing”
Abstract people or places in qualitative data (e.g., “Bob” to “[colleague]”)
Here we will show an example of generalization on the age column:
df_generalized <- df |>mutate(age_group =case_when( age <30~"under 30",TRUE~"30+" ))|>select(-age)df_generalized
# A tibble: 4 × 3
name height_cm age_group
<chr> <dbl> <chr>
1 Joel Miller 182 30+
2 Ellie Williams 160 under 30
3 Tommy Miller 185 30+
4 Abby Anderson 173 under 30
Replacement
Swap identifying info with less informative alternatives
Examples:
Use pseudonyms for names (with securely stored keyfile)
Replace with placeholders (e.g., “[redacted]”)
Rounding numeric values
Creating Pseudonyms
Pseudonyms should reveal nothing about the subject
Good pseudonyms:
Are random or meaningless strings/numbers
Are securely managed (e.g., encrypted keyfile)
Can be generated using tools in Excel, R, Python, SPSS
# A tibble: 4 × 3
name age height_cm_noisy
<chr> <dbl> <dbl>
1 Joel Miller 52 182.
2 Ellie Williams 19 160.
3 Tommy Miller 48 186.
4 Abby Anderson 28 174.
Permutation
Swap values between individuals
Makes linking variables across a record more difficult
# A tibble: 4 × 3
name age height_cm_permuted
<chr> <dbl> <dbl>
1 Joel Miller 52 160
2 Ellie Williams 19 173
3 Tommy Miller 48 182
4 Abby Anderson 28 185
The Non-Insured Health Benefits (NIHB) database contains sensitive health data on First Nations use of services like prescriptions, dental care, and medical devices.
In 2001, Health Canada began releasing de-identified NIHB pharmacy claims data to Brogan Inc., a private health consulting firm.
Though personal identifiers were removed, community identifiers remained, and First Nations were not informed until 2007.
Brogan sold the data to pharmaceutical companies for commercial research and marketing
Health Canada justified the release by claiming no privacy interests remained since personally identifying information had been removed.
Kukutai, T., & Taylor, J. (2016). Indigenous data sovereignty: Toward an agenda. ANU press.
Discussion
Was the data truly de-identified?
What are the limits of simply removing names and IDs from a dataset?
How can we measure whether a dataset is truly “safe” to release?
Should de-identified data still require community consent before being shared or sold?
Why basic deidentification isn’t always enough
Individuals can often be re-identified using other information.
As datasets become more detailed and linkable, privacy risks increase.
More advanced statistical methods are often needed to ensure meaningful deidentification while preserving data utility.
Statistical approaches to deidentification
\(k\)-anonymity
\(l\)-diversity
Differential privacy (advanced)
Overview of privacy models
\(k\)-anonymity and \(l\)-diversity are statistical approaches that quantify the level of identifiability within a tabular dataset.
They focus on how variables combined can lead to identification.
These approaches are complementary: a dataset can be simultaneously \(k\)-anonymous and \(l\)-diverse, where \(k\) and \(l\) represent numeric thresholds.
\(k\)-anonymity and \(l\)-diversityare typically used to de-identify tabular datasets before sharing.
They work best on relatively large datasets, where enough observations are present to preserve useful detail while still protecting privacy.
Identifiers, Quasi-Identifiers, and Sensitive Attributes
Privacy models distinguish between three types of variables:
Identifiers: Direct identifiers such as names, student numbers, email addresses.
Quasi-Identifiers: Indirect identifiers that can lead to identification when combined with other quasi-identifiers or external data.
Examples: age, sex, place of residence, physical characteristics, timestamps, etc.
Sensitive Attributes: Variables of interest that need protection and cannot be altered as they are key outcomes.
Examples: Medical condition, Income, etc.
Importance of Correct Variable Categorization
Correctly categorizing variables into identifiers, quasi-identifiers, and sensitive attributes is crucial.
This categorization determines how to de-identify your dataset effectively using \(k\)-anonymity, \(l\)-diversity, and \(t\)-closeness.
Now, let’s discuss each of these techniques in detail…
\(k\)-anonymity
A data set is \(k\)-anonymous if each observation cannot be distinguished from at least \(k-1\) other observations based on the quasi-identifiers.
This can be achieved through generalization, suppression and sometimes top- or bottom-coding of data values.
Applying \(k\)-anonymity makes it more difficult for an attacker to single out or re-identify specific individuals.
It also helps reduce the risk of the mosaic effect, where combining data points could lead to identification.
Making a data set \(k\)-anonymous
Identify variables as identifiers, quasi-identifiers and sensitive attributes.
Choose a value for \(k\).
Aggregate or transform the data so each combination of quasi-identifiers occurs at least k times.
Choosing \(k\)
There is no single correct value for \(k\)!
Higher \(k\) increases privacy, but reduces data detail and utility.
The choice depends on promises made to data subjects and acceptable risk levels.
Age and city are quasi-identifiers, and salary is considered a sensitive attribute.
Age
City
Salary
38
Calgary
91,000
37
Toronto
92,000
31
Vancouver
82,000
48
Calgary
115,000
39
Vancouver
118,000
37
Calgary
97,000
34
Toronto
98,000
33
Vancouver
89,000
32
Toronto
108,000
45
Calgary
95,000
\(k=2\)
Age Range
City
Salary Range
30–39
Calgary
90,000–99,999
30–39
Toronto
90,000–99,999
30–39
Vancouver
80,000–89,999
40–49
Calgary
110,000–119,999
30–39
Vancouver
110,000–119,999
30–39
Calgary
90,000–99,999
30–39
Toronto
90,000–99,999
30–39
Vancouver
80,000–89,999
30–39
Toronto
100,000–109,999
40–49
Calgary
90,000–99,999
Given the data, which field(s) could you generalize to help achieve k = 3 anonymity?
Age
ZIP Code
Disease
29
13053
Flu
27
13068
Flu
28
13068
Cold
45
14853
Diabetes
46
14853
Diabetes
47
14853
Cancer
A. Generalize Age into age ranges (e.g., 20–29, 40–49)
B. Suppress Disease entirely
C. Generalize ZIP Code to first 3 digits (e.g., 130, 148)
D. Generalize Age into age ranges (e.g., 20–29, 40–49) and ZIP code to first 3 digits (e.g., 130, 148)
E. It’s already \(k=3\) anonymous
Which of the following datasets violates \(k = 2\) anonymity?
Option A
Age
Sex
ZIP
34
M
02138
34
M
02138
34
F
02139
Option B
Age
Sex
ZIP
22
F
10011
22
F
10011
22
F
10011
Option C
Age Range
Sex
ZIP Prefix
30–39
*
021**
30–39
*
021**
30–39
*
021**
A. Only A
B. Only B
C. Only C
D. A and B
\(l\)-diversity
\(l\)-diversity is an extension of \(k\)-anonymity that ensures sufficient variation in a sensitive attribute.
This is important because if all individuals within a group share the same sensitive value, there is still a risk of inference.
Although these data are \(2\)-anonymous, we can still infer that any 30-39 year old from Calgary who participated earns between 90-99k.
Age Range
City
Salary Range
30–39
Calgary
90,000–99,999
30–39
Toronto
90,000–99,999
30–39
Vancouver
80,000–89,999
40–49
Calgary
110,000–119,999
30–39
Vancouver
110,000–119,999
30–39
Calgary
90,000–99,999
30–39
Toronto
90,000–99,999
30–39
Vancouver
80,000–89,999
30–39
Toronto
100,000–109,999
40–49
Calgary
90,000–99,999
\(l\)-diversity
The approach requires at least \(l\) different values for the sensitive attribute within each combination of quasi-identifiers.
Again, there is no perfect value for \(l\) (typically \(1< l \leq k\)).
With \(l=2\), that means that for each combination of Age Range and City, there are at least 2 distinct Salary Ranges.
Age Range
City
Salary Range
30–39
-
90,000–99,999
30–39
-
90,000–99,999
30–39
-
80,000–89,999
40–49
Calgary
110,000–119,999
30–39
-
110,000–119,999
30–39
-
90,000–99,999
30–39
-
90,000–99,999
30–39
-
80,000–89,999
30–39
-
100,000–109,999
40–49
Calgary
90,000–99,999
Consider this 3-anonymous dataset. Is it also 2-diverse with respect to “Condition”?
Age Range
ZIP Prefix
Condition
20–29
130**
Flu
20–29
130**
Flu
20–29
130**
Flu
30–39
148**
Cold
30–39
148**
Cold
30–39
148**
Cancer
A. Yes, both groups have 2 or more different values
B. No, one group violates l-diversity
C. Yes, because the dataset is already k-anonymous
D. No, both groups have only one distinct value
There are still issues…
Even though the data is de-identified, some sensitive patterns can still leak through.
In the example we discussed, both individuals are grouped into the same age range and city.
While they are in different salary ranges and exact values are hidden, the range is still quite narrow.
Due to the similarity of the salary ranges, one can still infer that both individuals earn between $90,000 and $119,999.
Age Range
City
Salary Range
40–49
Calgary
110,000–119,999
40–49
Calgary
90,000–99,999
Differential privacy
So, we may need more sophisticated tools to privatize our data…
Differential privacy is a mathematical approach to protecting privacy
It ensures algorithm results are nearly the same whether one person’s data is included or not
Differential privacy makes it hard to tell if any individual’s data is in the dataset, which protects individual’s information (even with unusual or unique data)